-
Notifications
You must be signed in to change notification settings - Fork 777
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed Issue: #1977 #2181
Fixed Issue: #1977 #2181
Conversation
Thank you for the PR! Is this exception raised whenever the documents are actually truncated or when you instantiate the related LLM? If it is the former, you would get an error after the clustering has been done which seems like wasted resources for the user. Would it perhaps be an idea to raise this error earlier? |
Good point! In my case right now, the error would be raised when I instantiate the relevant LLM. I will look to try and find a suitable place to raise the earlier and amend accordingly. |
Looking at an example implementation for an representation class such as llamacpp, the purpose of I might be wrong so I do apologise but wouldn't this mean the documents are truncated at representation model rather than at clustering. This means that we wouldn't be able to surface an issues after clustering. Therefore, would this mean the location of the function is reasonable? |
That is correct.
I understand your reasoning. However, that would mean that this very basic issue only gets flagged after most of the computation (and potentially hours of training) has already been done. Moreover, this is something we could already know when you initialize a given LLM. So by moving this: raise ValueError(
"Please select from one of the valid options for the `tokenizer` parameter: \n"
"{'char', 'whitespace', 'vectorizer'} \n"
"Alternatively if `tokenizer` is a callable ensure it has methods to encode and decode a document "
) To the if tokenizer is None and doc_length is not None:
raise ValueError(
"Please select from one of the valid options for the `tokenizer` parameter: \n"
"{'char', 'whitespace', 'vectorizer'} \n"
"Alternatively if `tokenizer` is a callable ensure it has methods to encode and decode a document "
) That will show the error the moment you actually create the LLM, so before any computation has been done and users can adjust accordingly. |
@SSivakumar12 Could you create the errors as a function that is imported from |
bertopic/representation/_cohere.py
Outdated
"{'char', 'whitespace', 'vectorizer'} \n" | ||
"If `tokenizer` is of type callable ensure it has methods to encode and decode a document \n" | ||
) | ||
_ = validate_truncate_document_parameters(self.tokenizer, self.doc_length) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you returning the _
for a specific reason? Looking at the validate_truncate_document_parameters
, there doesn't seem anything that is returned.
if tokenizer is None and doc_length is not None: | ||
raise ValueError( | ||
"Please select from one of the valid options for the `tokenizer` parameter: \n" | ||
"{'char', 'whitespace', 'vectorizer'} \n" | ||
"If `tokenizer` is of type callable ensure it has methods to encode and decode a document \n" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also include a check for the opposite? That someone does use a tokenizer but not a doc_length?
bertopic/representation/_utils.py
Outdated
else: | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this isn't necessary right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No it is not necessary but I included it to be verbose in terms of the possible edgecase. That being said, happy to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove it and I'll make sure to merge it after that! Other than that, it looks great!
@SSivakumar12 Could you check the linting? It is failing there. |
Apologies, for the oversight, have resolved the linting issues locally and the build should pass on linting now |
Awesome, everything is looking good. Thank you for your work on this! |
First of all, I want to take the opportunity to thank the awesome work being done by Martin and all contributors both past and present. This is a really good tool that we have used at work and has provided us with a lot of meaningful insights so I figured that I could do my bit to contribute to a tool that has helped me previously :)
This is my first PR and I sincerely hope it is not my last especially within this repo since I really want to give back to this project.
What does this PR do?
Address the issue raised in issues #1977 which I believe has not been resolved yet.
Fixes # (issue)
I have added an edge case so that if a tokenizer isn't any of the expected patterns a ValueError is raised and produced a constructive error message to hopefully to prevent a repeat of the error. I did not update the documentation and add any tests. That being said, I am more than happy to add tests if deemed appropriate/valuable
Before submitting
- didn't update documentation but added a error message
- no but happy to do so in a separate PR since I think it doesn't fall into the scope of this current PR?